AITopics | image and text

Collaborating Authors

image and text

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Neural Information Processing SystemsApr-25-2026, 21:26:27 GMT

Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Fine-Grained Semantically Aligned Vision-Language Pre-Training

Neural Information Processing SystemsApr-25-2026, 08:28:37 GMT

Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts, or advanced cross-modal attention upon image and text features. However, they fail to explicitly learn the fine-grained semantic alignment between visual regions and textual phrases, as only global image-text alignment information is available. In this paper, we introduce LOUPE, a fine-grained semantically aLigned visiOnlangUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions. To efficiently compute the game-theoretic interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Experiments show that LOUPE achieves stateof-the-art performance on a variety of vision-language tasks. Furthermore, without any object-level human annotations and fine-tuning, LOUPE achieves competitive performance on object detection and visual grounding. More importantly, LOUPE opens a new promising direction of learning fine-grained semantics from largescale raw image-text pairs. The repository of this work is at https://github.

artificial intelligence, interaction, machine learning, (12 more...)

Neural Information Processing Systems

Genre: Instructional Material (0.34)

Industry: Leisure & Entertainment > Games (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Overleaf Example

Neural Information Processing SystemsFeb-16-2026, 07:25:07 GMT

Such a lack of alignment and uniformity might restrict the transferability and robustness of embeddings. To this end, we devise a new fine-tuning method for robust representation equipping better alignment and uniformity. First, we propose a Geodesic Multi-Modal Mixup that mixes the embeddings of image and text to generate hard negative samples on the hypersphere.

artificial intelligence, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country:

Asia > South Korea > Seoul > Seoul (0.40)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

d46662aa53e78a62afd980a29e0c37ed-Paper-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 03:19:10 GMT

encoder, image-text pair, representation, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
North America > Canada > British Columbia > Vancouver (0.04)
(13 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

Neural Information Processing SystemsFeb-9-2026, 17:35:54 GMT

During optimization, contrastive learning keeps the different modalities separated by a certain distance, which is influenced by the temperature parameter in the loss function. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model's downstream zero-shot classification performance and fairness.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: North America > United States > California > Santa Clara County > Palo Alto (0.05)

Genre: Research Report > Experimental Study (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

A Training details

Neural Information Processing SystemsFeb-8-2026, 12:07:04 GMT

Models were trained with 32 experts, with experts placed every 2 layers - except where explicitly stated. The learned contrastive temperature parameter is initialised at 10. We train models at batch size 16,384 for 781,250 steps at resolution 224. These are B/16 models trained for 100,000 steps at batch size 8192. The default training data is mixed with data from JFT -4B with a ratio of 3:1.

artificial intelligence, machine learning, text token, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

theLanguage

Neural Information Processing SystemsFeb-8-2026, 12:07:00 GMT

However, such models are typically trained on a single modality at a time.

artificial intelligence, machine learning, modality, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
(3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

3379ce104189b72d5f7baaa03ae81329-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 06:26:53 GMT

image-text matching, knowledge, representation, (12 more...)

Neural Information Processing Systems

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.75)
Information Technology > Artificial Intelligence > Vision (0.70)

Add feedback

Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

Neural Information Processing SystemsDec-27-2025, 03:02:30 GMT

We present a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts. Directly learning a conditional generative model from images or texts to 3D shapes is prone to producing inconsistent results with the conditions because 3D shapes have an additional dimension whose distribution significantly differs from that of 2D images and texts. To bridge the domain gap among the three modalities and facilitate multi-modal-conditioned 3D shape generation, we explore representing 3D shapes in a shape-image-text-aligned space. Our framework comprises two models: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and a conditional Aligned Shape Latent Diffusion Model (ASLDM). The former model encodes the 3D shapes into the shape latent space aligned to the image and text and reconstructs the fine-grained 3D neural fields corresponding to given shape embeddings via the transformer-based decoder. The latter model learns a probabilistic mapping function from the image or text space to the latent shape space. Our extensive experiments demonstrate that our proposed approach can generate higher-quality and more diverse 3D shapes that better semantically conform to the visual or textural conditional inputs, validating the effectiveness of the shape-image-text-aligned space for cross-modality 3D shape generation.

michelangelo, shape generation, shape-image-text aligned latent representation, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.59)

Add feedback

Filters

Collaborating Authors

image and text

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Fine-Grained Semantically Aligned Vision-Language Pre-Training

Overleaf Example

d46662aa53e78a62afd980a29e0c37ed-Paper-Conference.pdf

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

A Training details

theLanguage

3379ce104189b72d5f7baaa03ae81329-Paper-Conference.pdf

2fb4be70fc9668e9ec2c71b34fb127d4-Paper-Conference.pdf

Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation